I. Reading and processing the New York Times (NYT) state-level COVID-19 data
1. Read in the Data
library(plotly)
Loading required package: ggplot2
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(here)
here() starts at /Users/CatherineLe/Desktop
library(zoo)
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
## data extracted from New York Times state-level data from NYT Github repository# https://github.com/nytimes/covid-19-data## state-level population information from us_census_data available on GitHub repository:# https://github.com/COVID19Tracking/associated-data/tree/master/us_census_data# load COVID state-level data from NYTcv_states <-as.data.frame(read.csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"))# load state population datastate_pops <-as.data.frame(read.csv("https://raw.githubusercontent.com/COVID19Tracking/associated-data/master/us_census_data/us_census_2018_population_estimates_states.csv"))state_pops$abb <- state_pops$statestate_pops$state <- state_pops$state_namestate_pops$state_name <-NULL# merging the 2 datasetscv_states <-merge(cv_states, state_pops, by="state")
2. Look at the Data
dim(cv_states)
[1] 58094 9
head(cv_states)
state date fips cases deaths geo_id population pop_density abb
1 Alabama 2023-01-04 1 1587224 21263 1 4887871 96.50939 AL
2 Alabama 2020-04-25 1 6213 213 1 4887871 96.50939 AL
3 Alabama 2023-02-26 1 1638348 21400 1 4887871 96.50939 AL
4 Alabama 2022-12-03 1 1549285 21129 1 4887871 96.50939 AL
5 Alabama 2020-05-06 1 8691 343 1 4887871 96.50939 AL
6 Alabama 2021-04-21 1 524367 10807 1 4887871 96.50939 AL
The format of the variables are not completely correct. The “date” variable is a character variable when it should be a date variable. Additionally, the “state” variable is a character variable when it should be a factor variable.
3. Format the Data
# format the datecv_states$date <-as.Date(cv_states$date, format="%Y-%m-%d")# format the state and state abbreviation (abb) variablesstate_list <-unique(cv_states$state)cv_states$state <-factor(cv_states$state, levels = state_list)abb_list <-unique(cv_states$abb)cv_states$abb <-factor(cv_states$abb, levels = abb_list)### FINISH THE CODE HERE # order the data first by state, second by datecv_states = cv_states[order(cv_states$state, cv_states$date),]
# Confirm the variables are now correctly formattedstr(cv_states)
The date and state variable are now correctly formatted.
# Inspect the range values for each variable. What is the date range? The range of cases and deaths?head(cv_states)
state date fips cases deaths geo_id population pop_density abb
1029 Alabama 2020-03-13 1 6 0 1 4887871 96.50939 AL
597 Alabama 2020-03-14 1 12 0 1 4887871 96.50939 AL
282 Alabama 2020-03-15 1 23 0 1 4887871 96.50939 AL
12 Alabama 2020-03-16 1 29 0 1 4887871 96.50939 AL
266 Alabama 2020-03-17 1 39 0 1 4887871 96.50939 AL
78 Alabama 2020-03-18 1 51 0 1 4887871 96.50939 AL
summary(cv_states)
state date fips cases
Washington : 1158 Min. :2020-01-21 Min. : 1.00 Min. : 1
Illinois : 1155 1st Qu.:2020-12-06 1st Qu.:16.00 1st Qu.: 112125
California : 1154 Median :2021-09-11 Median :29.00 Median : 418120
Arizona : 1153 Mean :2021-09-10 Mean :29.78 Mean : 947941
Massachusetts: 1147 3rd Qu.:2022-06-17 3rd Qu.:44.00 3rd Qu.: 1134318
Wisconsin : 1143 Max. :2023-03-23 Max. :72.00 Max. :12169158
(Other) :51184
deaths geo_id population pop_density
Min. : 0 Min. : 1.00 Min. : 577737 Min. : 1.292
1st Qu.: 1598 1st Qu.:16.00 1st Qu.: 1805832 1st Qu.: 43.659
Median : 5901 Median :29.00 Median : 4468402 Median : 107.860
Mean : 12553 Mean :29.78 Mean : 6397965 Mean : 423.031
3rd Qu.: 15952 3rd Qu.:44.00 3rd Qu.: 7535591 3rd Qu.: 229.511
Max. :104277 Max. :72.00 Max. :39557045 Max. :11490.120
NA's :1106
abb
WA : 1158
IL : 1155
CA : 1154
AZ : 1153
MA : 1147
WI : 1143
(Other):51184
min(cv_states$date)
[1] "2020-01-21"
max(cv_states$date)
[1] "2023-03-23"
The date range is early January of 2021 to March of 2023. The range of cases is 1 to more than one million. The range of deaths is 0 to more than 100,000.
4. Add new_cases and new_deaths and correct outliers
for (i in1:length(state_list)) { cv_subset =subset(cv_states, state == state_list[i]) cv_subset = cv_subset[order(cv_subset$date),]# add starting level for new cases and deaths cv_subset$new_cases = cv_subset$cases[1] cv_subset$new_deaths = cv_subset$deaths[1]### FINISH THE CODE HEREfor (j in2:nrow(cv_subset)) { cv_subset$new_cases[j] = cv_subset$cases[j] - cv_subset$cases[j -1] cv_subset$new_deaths[j] = cv_subset$cases[j] - cv_subset$deaths[j -1] }# include in main dataset cv_states$new_cases[cv_states$state==state_list[i]] = cv_subset$new_cases cv_states$new_deaths[cv_states$state==state_list[i]] = cv_subset$new_deaths}
# Focus on recent datescv_states <- cv_states %>% dplyr::filter(date >="2021-06-01")
### FINISH THE CODE HERE# Inspect outliers in new_cases using plotlyp1<-ggplot(cv_states, aes(x = date, y = new_cases, color = state)) +geom_hline(yintercept =0, linetype ="dashed", color ="darkred") +geom_point(size = .5, alpha =0.5)ggplotly(p1)
p1<-NULL# to clear from workspacep2<-ggplot(cv_states, aes(x = date, y = new_deaths, color = state)) +geom_hline(yintercept =0, linetype ="dashed", color ="darkred") +geom_point(size = .5, alpha =0.5)ggplotly(p2)
p2<-NULL# to clear from workspace
I added a dark red line at the y-axis to see if there are any negative values for new_cases and new_deaths.
# Recalculate `cases` and `deaths` as cumulative sum of updated `new_cases` and `new_deaths`for (i in1:length(state_list)) { cv_subset =subset(cv_states, state == state_list[i])# add starting level for new cases and deaths cv_subset$cases = cv_subset$cases[1] cv_subset$deaths = cv_subset$deaths[1]### FINISH CODE HEREfor (j in2:nrow(cv_subset)) { cv_subset$cases[j] = cv_subset$new_cases[j] + cv_subset$cases[j -1] cv_subset$deaths[j] = cv_subset$new_deaths[j] + cv_subset$deaths[j -1] }# include in main dataset cv_states$cases[cv_states$state==state_list[i]] = cv_subset$cases cv_states$deaths[cv_states$state==state_list[i]] = cv_subset$deaths}
Warning in cv_subset$new_deaths[j] + cv_subset$deaths[j - 1]: NAs produced by
integer overflow
Warning in cv_subset$new_deaths[j] + cv_subset$deaths[j - 1]: NAs produced by
integer overflow
Warning in cv_subset$new_deaths[j] + cv_subset$deaths[j - 1]: NAs produced by
integer overflow
Warning in cv_subset$new_deaths[j] + cv_subset$deaths[j - 1]: NAs produced by
integer overflow
# Smooth new countscv_states$new_cases = zoo::rollmean(cv_states$new_cases, k=7, fill=NA, align='right') %>%round(digits =0)cv_states$new_deaths = zoo::rollmean(cv_states$new_deaths, k=7, fill=NA, align='right') %>%round(digits =0)# Inspect data again interactivelyp2<-ggplot(cv_states, aes(x = date, y = new_deaths, color = state)) +geom_line() +geom_point(size = .5, alpha =0.5)ggplotly(p2)
p2=NULL
5. Add additional variables
# add population normalized (by 100,000) counts for each variablecv_states$per100k =as.numeric(format(round(cv_states$cases/(cv_states$population/100000),1),nsmall=1))cv_states$newper100k =as.numeric(format(round(cv_states$new_cases/(cv_states$population/100000),1),nsmall=1))
# add a naive_CFR variable = deaths / casescv_states = cv_states %>%mutate(naive_CFR =round((deaths*100/cases),2))# create a `cv_states_today` variablecv_states_today =subset(cv_states, date==max(cv_states$date))
II. Scatterplots
6. Explore scatterplots using plot_ly()
# pop_density vs. casescv_states_today %>%plot_ly(x =~pop_density, y =~cases, type ='scatter', mode ='markers', color =~state,size =~population, sizes =c(5, 70), marker =list(sizemode='diameter', opacity=0.5))
Warning: Ignoring 1 observations
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
# filter out "District of Columbia"cv_states_today_filter <- cv_states_today %>%filter(state!="District of Columbia")# pop_density vs. cases after filteringcv_states_today_filter %>%plot_ly(x =~pop_density, y =~cases, type ='scatter', mode ='markers', color =~state,size =~population, sizes =c(5, 70), marker =list(sizemode='diameter', opacity=0.5))
Warning: Ignoring 1 observations
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
# pop_density vs. deathsper100kcv_states_today_filter %>%plot_ly(x =~pop_density, y =~deathsper100k,type ='scatter', mode ='markers', color =~state,size =~population, sizes =c(5, 70), marker =list(sizemode='diameter', opacity=0.5))
Warning: Ignoring 5 observations
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
# Adding hoverinfocv_states_today_filter %>%plot_ly(x =~pop_density, y =~deathsper100k,type ='scatter', mode ='markers', color =~state,size =~population, sizes =c(5, 70), marker =list(sizemode='diameter', opacity=0.5),hoverinfo ='text',text =~paste( paste(state, ":", sep=""), paste(" Cases per 100k: ", per100k, sep="") , paste(" Deaths per 100k: ", deathsper100k, sep=""), sep ="<br>")) %>%layout(title ="Population-normalized COVID-19 deaths (per 100k) vs. population density for US states",yaxis =list(title ="Deaths per 100k"), xaxis =list(title ="Population Density"),hovermode ="compare")
Warning: Ignoring 5 observations
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning: n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
7. Explore scatterplot trend interactively using ggplotly() and geom_smooth()
Warning: The following aesthetics were dropped during statistical transformation: size
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
There is an inverse correlation between population density and deathsper100k.
8. Multiple Line Chart
### FINISH CODE HERE# Line chart for naive_CFR for all states over time using `plot_ly()`plot_ly(cv_states, x =~date, y =~naive_CFR, color =~state, type ="scatter", mode ="lines")
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
Returning the palette you asked for with that many colors
All of the states experienced an increase in naive_CFR over time. The states that had an increase in September only continue to increase over time.
### FINISH CODE HERE# Line chart for Florida showing new_cases and new_deaths togethercv_states %>%filter(state=="Florida") %>%plot_ly(x =~date, y =~new_cases, type ="scatter", mode ="lines", name ="New Cases") %>%add_trace(x =~date, y =~new_deaths, type ="scatter", mode ="lines", name ="New Deaths")
# I used add_trace because add_layer wasn't working# I also added names to the traces because it was confusing otherwise